by Jeff McDonald, Computational Analysis Applications Manager
This article first reviews the shared memory programming model by referencing introductory level material to help you add parallel capability to your application software. The article also examines the dynamic thread capability provided in MIPSproTM FORTRAN-based parallel applications in IRIXTMá6.0.2 or later. Dynamic threads enable a parallel program to use a varying number of processors during a run, based on system load. By using this simple feature, you can dramatically improve total system throughput in a production environment with high load averages. In order to quantify this throughput performance advantage, I present results from a simple throughput experiment.
The next section provides background information on the MIPSpro FORTRAN parallel programming model. If you are already familiar with this model, you can skip to the Dynamic Threads section.
Parallel processing enables application performance to exceed the performance available from a single processor. When turnaround time is important to the user of your application, you might want to invest in parallelization, especially with the recent advent of affordable and powerful multiprocessor systems like the POWER CHALLENGETMá10000. The shared memory parallel programming model supported on Silicon Graphics® multiprocessor systems (see Reference 1) makes it easy for you to add parallel capability to your application incrementally, because it enables all processors participating in a computation to directly address all data associated with the job.
The primary advantage of the shared memory model over distributed memory, message-passing models, such as message passing interface (MPI) and parallel virtual machine (PVM), is the ease of use and the ability to parallelize applications incrementally and/or partially. A distributed memory model forces you to manage explicitly the distribution and communication of data across processors, typically resulting in code changes that are both extensive and pervasive. This important distinction that the shared memory approach enjoys over other paradigms is often overlooked. To the application programmer, this means parallelization is not an all-or-nothing proposition; rather, it can begin in modest fashion and proceed with additional parallel content over time and application releases, all with minimal source-code changes.
For an introduction to parallel programming using this model, I recommend Practical Parallel Programming (see Reference 2). The basic parallel construct in MIPSpro FORTRAN is at the loop level and is derived from an ordinary DO loop, modified by a C$DOACROSS directive immediately preceding the loop. The execution model is such that except for loops preceded with a DOACROSS, processing proceeds on a single master thread. On entering a DOACROSS loop, additional threads are invoked. The total number of threads is typically set either before program execution by the environment variable MP_SET_NUMTHREADS, or is set explicitly during run time by call(s) to the mp_set_numthreads() routine. If you run a parallelized program without explicitly setting the number of threads, the program uses all of the processors in the host system, up to a maximum of eight.
The objective of parallelization is to achieve higher performance levels. In production environments it is also of interest to maximize total system throughput. By running parallel jobs on loaded systems such that the resulting load average exceeds the number of processors in the system, you can reduce overall system throughput. Dynamic threads, as described in the next section, is a simple-to-use mechanism that helps maximize system throughput by dynamically changing the number of processors used by parallel jobs based on changing system load conditions, as measured at run time.
The dynamic threads capability is enabled by the environment variable MP_SUGNUMTHD. When set, the variable indicates to the run-time system that value of MP_SET_NUMTHREADS (or the default if this environment variable is not explicitly set) should be treated as a suggestion only. The presence of MP_SUGNUMTHD causes the run-time library to create an additional, asynchronous process that is activated approximately every 3 seconds (in IRIXá6.2) to monitor total system load. When idle processors exist, the library increases the number of threads available to the job, up to a maximum of MP_SET_NUMTHREADS. When the system load increases, it decreases the number of threads available to the job, possibly to as few as one. When MP_SUGNUMTHD is not set or is unset, dynamic threads is disabled and multithreading works as before.
Another environment variable that I explore in the simple tests in the next section is MP_BLOCKTIME. This variable controls the behavior of slave processes that are idle while the master thread executes sequential code regions. Each slave process busy-waits (by executing an empty loop) for MP_BLOCKTIME iterations before suspending itself and yielding the processor to another process. The default value for MP_BLOCKTIME is currently 100,000,000 in IRIXá6.2; MP_BLOCKTIME=0 indicates that the slave threads should never block; and MP_BLOCKTIME=1 causes the slave threads to block immediately after each parallel loop. Blocking results in higher parallel loop startup overhead on subsequent parallel loops, but it frees the processors used by slave threads immediately for use by other processes or threads. You can view a busy wait as wasting CPU time in exchange for better job performance on unloaded systems.
To illustrate the effect on system throughput by using dynamic threads and setting MP_BLOCKTIME to block slaves immediately, I ran various job mixes consisting of combinations of four different jobs. I built each job to exhibit a distinct combination of run-time and parallel efficiency. Each of the four jobs is parallelized to some degree, and is set up to request four processors by setting MP_SET_NUMTHREADS to 4. All the parallel jobs in each mix of four are run simultaneously on a 4-processor POWER CHALLENGEá10000. The total number of processors requested in each job mix, therefore, is 16, oversubscribing the number of available processors by a factor of four.
All four jobs are variants of the same small (3,500-line) computational fluid dynamics code. The computational requirements of each job, on both one and four processors, are described in Tableá1, along with the percentage of total work done by each job in parallel loops. Besides these jobs, an additional small load was present on the test system due to network backup activity and an expected small amount of operating system overhead. The two jobs with the long_ prefix run about two times as long as the others, and the _pp suffix is used to denote the two less-parallelized jobs.
Job Name | Single 1-CPU Job Elapsed Time (sec) |
Single 4-CPU Job Elapsed Time (sec) |
Percentage of Work Parallelized |
fluid | 74 | 30 | 79% |
long_fluid | 144 | 55 | 82% |
fluid_pp | 73 | 52 | 38% |
long_fluid_pp | 143 | 94 | 47% |
I ran seven job mixes as described in Tableá2. For static thread results, I ran the job mixes after setting the MP_SET_NUMTHREADS environment variable to 4. I ran the job mixes with static threads a second time with the environment variables MP_SET_NUMTHREADS set to 4 again, and MP_BLOCKTIME set to 1. Finally, I obtained dynamic thread results by setting MP_SET_NUMTHREADS once again to 4 and MP_SUGNUMTHD to any value.
Job Mix # | # of fluid jobs | # of long_fluid jobs | # of fluid_pp jobs | # of long_fluid_pp jobs |
1 | 4 | 0 | 0 | 0 |
2 | 2 | 2 | 0 | 0 |
3 | 3 | 1 | 0 | 0 |
4 | 0 | 0 | 4 | 0 |
5 | 0 | 0 | 2 | 2 |
6 | 0 | 0 | 3 | 1 |
7 | 2 | 0 | 2 | 0 |
Tableá3 lists the total CPU time (user+system times for all jobs in mix) and elapsed times to complete each job mix for each run-time environment. The table also lists savings offered by dynamic threads over default static threads, as well as the savings measured in using static threads with MP_BLOCKTIME equal to 1 over default static threads, both in terms of CPU time and elapsed time.
Environment | Job Mix | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Static Threads | Total CPU (sec) |
424 | 609 | 505 | 854 | 1,283 | 1,030 | 618 |
Elapsed Time (sec) |
111 | 159 | 134 | 222 | 337 | 270 | 158 | |
Static Threads with MP_BLOCKTIME=1 |
Total CPU (sec) |
440 | 634 | 542 | 419 | 599 | 503 | 462 |
Elapsed Time (sec) |
116 | 170 | 145 | 110 | 172 | 158 | 123 | |
Dynamic Threads | Total CPU (sec) |
324 | 495 | 418 | 350 | 555 | 496 | 333 |
Elapsed Time (sec) |
85 | 138 | 115 | 96 | 155 | 138 | 92 | |
Percent Savings: Dynamic Threads vs. Static Threads |
CPU Time | 96% | 19% | 17% | 59% | 57% | 52% | 46% |
Elapsed Time | 23% | 13% | 14% | 57% | 54% | 49% | 42% | |
Percent Savings: Static Threads with MP_BLOCKTIME=1 vs. Static Threads |
CPU Time | 95% | -4% | -7% | 51% | 53% | 51% | 25% |
Elapsed Time | -5% | -7% | -8% | 50% | 49% | 41% | 22% |
Dynamic threads greatly enhance overall performance in the job mixes analyzed. The range of system performance improvement observed in this experiment, from less than 15 percent to almost 60 percent, is representative of that enjoyed by several large commercial applications that use this capability. The overhead incurred by using dynamic threads is very small, typically limited to a couple of percent, so the benefits greatly outweigh the cost.
One of the points that Table 3 illustrates is the impact of having slave processes block immediately after parallel regions by setting MP_BLOCKTIME=1 in conjunction with static threads. As you might expect, this helps when jobs are only partially parallel, but it hinders performance when dealing with very parallel applications. For job mixes involving only the two partially parallel jobs, having MP_BLOCKTIME set to 1 resulted in performance almost as good as that delivered by dynamic threads. However, using MP_BLOCKTIME to control throughput performance lacks the generality of dynamic threads.
You can use the environment variables MP_SUGNUMTHD_MIN and MP_SUGNUMTHD_MAX to limit the range of processors that the dynamic threads capability uses. When you set MP_SUGNUMTHD_MIN to an integer value between 1 and the value of MP_SET_NUMTHREADS, the process does not decrease the number of threads below that value. When you set MP_SUGNUMTHD_MAX to an integer value between the minimum number of threads and MP_SET_NUMTHREADS, the process does not increase the number of threads above that value. In all of the job mixes presented in the experiment, the number of processors was allowed to vary across the entire range of 1 to the number of processors in the system.
If you set the environment variable MP_SUGNUMTHD_VERBOSE to any value, informational messages are written to stderr when the process changes the number of threads available to a job.
Calls to mp_numthreads() and mp_set_numthreads() indicate that the application depends on the specific number of threads in use. Because of this, the number of threads available to a job is frozen upon either of these calls for the remainder of the job; and if MP_SUGNUMTHD_VERBOSE is set, a message to that effect is written to stderr.
Another restriction is that parallel jobs must encounter a number of parallel loops during execution, in order for the dynamic threads to be effective. This is due to the fact that the number of threads made available to a job is altered only on entry to a parallel loop, never during the execution of a particular loop. The extreme example is a program with one parallel DO loop that is executed only once. This program would use the specified maximum number of threads throughout that single loop's lifetime. In the example jobs presented, from 75 to 150 parallel loops were executed.
It is also important to have parallel loops encountered in a well-distributed fashion over the total job run time. This enables the thread count to adjust in a more continuous fashion to a varying system load. In practice this rarely becomes an issue with large applications exploiting loop level parallelism. Obviously, very short running jobs have little to gain from dynamic threads when thread counts are adjusted only every few seconds.
The use of dynamic threads can dramatically improve system throughput and even job turnaround time on loaded systems. Moreover, the improvements are greater for partially parallelized applications. Because the overhead incurred is minimal, I recommend using these capabilities for all parallel applications that are not subject to the identified restrictions.
We welcome feedback and comments at
devprogram@sgi.com.